Learning Classifiers for Assigning Protein Sequences to Gene Ontology Functional Families
نویسندگان
چکیده
Assigning putative functions to novel proteins and the discovery of sequence correlates of protein function are important challenges in bioinformatics. In this paper, we explore several machine learning approaches to data-driven construction of classifiers for assigning protein sequences to appropriate Gene Ontology (GO) function families using a class conditional probabilistic representation of amino acid sequences. Specifically, we represent protein sequences using class conditional probability distribution of amino acids (amino acid composition) or short (k-letter) subsequences (k-grams) of amino acids. We compare a model (NB k-grams) that ignores the statistical dependencies among overlapping k-grams with an alternative, NB(k), that uses an undirected probabilistic graphical model that captures the relevant dependencies. These two methods require only one pass through the training data during the learning phase, making them especially attractive in settings where there is a need to update the classifiers as new training data become available. We also explore a support vector machine (SVM) classifier, SVM k-grams, trained on the k-gram class conditional probability distributions of sequences. We report the performance of the resulting classifiers on three data sets of functional families from the Gene Ontology (GO) database. Our results show that NB(k) classifier outperforms NB k-grams in terms of accuracy of classification (as measured by cross-validation) by a few percentage points. SVM k-grams outperforms NB(k) in the majority of test cases. These results suggest the possibility of developing fully automated and computationally efficient approaches to construction of classifiers based on undirected graphical models of overlapping k-grams that can be easily updated as additional training data become available. Our results also show that further gains in accuracy of the classifiers are achievable (at the expense of increased computational demands and hence greater difficulty of frequent updates to the classifier as new training data become available) using SVM k-grams.
منابع مشابه
Learning Classifiers for Assigning Protein Sequences to Gene Ontology Functional Families: Combining of Function Annotation Using Sequence Homology With that Based on Amino Acid k-gram Composition Yields More Accurate Classifiers Than Either of the Individual Approaches
Background
متن کاملPrediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks
Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...
متن کاملData-Driven Generation of Decision Trees for Motif-Based Assignment of Protein Sequences to Functional Families
This paper describes an approach to data-driven discovery of sequence motif-based models in the form of decision trees for assigning protein sequences to functional families. Unlike approaches that try to classify protein sequences based on presence of a single motif, this method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of m...
متن کاملAutomated data-driven discovery of motif-based protein function classifiers
AUTOMATED DATA-DRIVEN DISCOVERY OF MOTIF-BASED PROTEIN FUNCTION CLASSIFIERS Xiangyun Wang, Diane Schroeder, Drena Dobbs, and Vasant Honavar Artificial Intelligence Laboratory Department of Computer Science and Graduate Program in Bioinformatics and Computational Biology Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html [email protected] ABSTRACT This paper ...
متن کاملGENERATING FUZZY RULES FOR PROTEIN CLASSIFICATION
This paper considers the generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily. Since the main objective of this classifier is the interpretability of rules, we have used the distribution of amino acids in the sequences of proteins as features. These features are the occurrence probabilities of six exchange groups in the seque...
متن کامل